Life cycle of Machine learning model
The Shop Customer Data is a comprehensive dataset that offers valuable insights into the ideal customers of a hypothetical shop. It collects and analyzes customer data through membership cards, providing a detailed understanding of the customer base.
The dataset consists of 2000 records with 8 columns, each representing a specific aspect of the customer's profile. These columns include Customer ID, Gender, Age, Annual Income, Spending Score, Profession, Work Experience, and Family Size.
Analyzing this data helps businesses gain insights into customer preferences, behaviors, and purchasing habits.For example, segmentation based on age, income, or family size can reveal how these factors influence purchasing decisions.
Here's a breakdown of the key points about each column:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
%matplotlib inline
#Display all the columns of the dataframe
pd.pandas.set_option("display.max_columns", None)
# creating dataframe
df = pd.read_csv("Customers.csv")
# printing the shape of dataset
print(df.shape)
(2000, 8)
df.head()
| CustomerID | Gender | Age | Annual Income ($) | Spending Score (1-100) | Profession | Work Experience | Family Size | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Male | 19 | 15000 | 39 | Healthcare | 1 | 4 |
| 1 | 2 | Male | 21 | 35000 | 81 | Engineer | 3 | 3 |
| 2 | 3 | Female | 20 | 86000 | 6 | Engineer | 1 | 1 |
| 3 | 4 | Female | 23 | 59000 | 77 | Lawyer | 0 | 2 |
| 4 | 5 | Female | 31 | 38000 | 40 | Entertainment | 2 | 6 |
# Let's look at the datatypes of different features
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2000 entries, 0 to 1999 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CustomerID 2000 non-null int64 1 Gender 2000 non-null object 2 Age 2000 non-null int64 3 Annual Income ($) 2000 non-null int64 4 Spending Score (1-100) 2000 non-null int64 5 Profession 1965 non-null object 6 Work Experience 2000 non-null int64 7 Family Size 2000 non-null int64 dtypes: int64(6), object(2) memory usage: 125.1+ KB
df_num=df.select_dtypes(include=np.number)
df_cat=df.select_dtypes(include='object')
print("There are",len(df_num.columns),"numerical variables in the dataset:",list(df_num.columns))
print("\n")
print("There are",len(df_cat.columns),"categorical variables in the dataset:",list(df_cat.columns))
There are 6 numerical variables in the dataset: ['CustomerID', 'Age', 'Annual Income ($)', 'Spending Score (1-100)', 'Work Experience', 'Family Size'] There are 2 categorical variables in the dataset: ['Gender', 'Profession']
df.isnull().sum()
CustomerID 0 Gender 0 Age 0 Annual Income ($) 0 Spending Score (1-100) 0 Profession 35 Work Experience 0 Family Size 0 dtype: int64
(df.isnull().mean())*100
CustomerID 0.00 Gender 0.00 Age 0.00 Annual Income ($) 0.00 Spending Score (1-100) 0.00 Profession 1.75 Work Experience 0.00 Family Size 0.00 dtype: float64
df['Profession'].value_counts()
Artist 612 Healthcare 339 Entertainment 234 Engineer 179 Doctor 161 Executive 153 Lawyer 142 Marketing 85 Homemaker 60 Name: Profession, dtype: int64
df['Profession'].mode()[0]
'Artist'
df.Profession.fillna(df['Profession'].mode()[0], inplace=True)
df.isnull().sum()
CustomerID 0 Gender 0 Age 0 Annual Income ($) 0 Spending Score (1-100) 0 Profession 0 Work Experience 0 Family Size 0 dtype: int64
No null values in the dataset now!
df.duplicated().sum()
0
df.head()
| CustomerID | Gender | Age | Annual Income ($) | Spending Score (1-100) | Profession | Work Experience | Family Size | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Male | 19 | 15000 | 39 | Healthcare | 1 | 4 |
| 1 | 2 | Male | 21 | 35000 | 81 | Engineer | 3 | 3 |
| 2 | 3 | Female | 20 | 86000 | 6 | Engineer | 1 | 1 |
| 3 | 4 | Female | 23 | 59000 | 77 | Lawyer | 0 | 2 |
| 4 | 5 | Female | 31 | 38000 | 40 | Entertainment | 2 | 6 |
# Fivepoint summary of the dataset
df.describe()
| CustomerID | Age | Annual Income ($) | Spending Score (1-100) | Work Experience | Family Size | |
|---|---|---|---|---|---|---|
| count | 2000.000000 | 2000.000000 | 2000.000000 | 2000.000000 | 2000.000000 | 2000.000000 |
| mean | 1000.500000 | 48.960000 | 110731.821500 | 50.962500 | 4.102500 | 3.768500 |
| std | 577.494589 | 28.429747 | 45739.536688 | 27.934661 | 3.922204 | 1.970749 |
| min | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 25% | 500.750000 | 25.000000 | 74572.000000 | 28.000000 | 1.000000 | 2.000000 |
| 50% | 1000.500000 | 48.000000 | 110045.000000 | 50.000000 | 3.000000 | 4.000000 |
| 75% | 1500.250000 | 73.000000 | 149092.750000 | 75.000000 | 7.000000 | 5.000000 |
| max | 2000.000000 | 99.000000 | 189974.000000 | 100.000000 | 17.000000 | 9.000000 |
df['Gender'].value_counts()
Female 1186 Male 814 Name: Gender, dtype: int64
# Create a pie chart to visualize the distribution of gender in the dataset
fig = px.pie(values=df['Gender'].value_counts(), names=df['Gender'].value_counts().index)
# Enhance the plot by adding a title and labels
fig.update_layout(title="Distribution of Gender in the Dataset")
# Create a bar chart to visualize the distribution of gender in the dataset
fig2 = px.bar(y=df['Gender'].value_counts(), x=df['Gender'].value_counts().index, color=df['Gender'].value_counts().index)
# Display the plot
fig.show()
fig2.show()
It's important to understand that this bias could affect how well machine learning models perform when trained on this dataset. This is especially true if the dataset is used to predict results or make decisions that could be influenced by gender.
df['Age'].value_counts()
31 31
32 30
52 30
91 29
63 28
..
42 12
10 12
77 12
71 12
98 9
Name: Age, Length: 100, dtype: int64
import plotly.express as p
fig = px.histogram(df, x="Age",
title='Histogram of Age',
labels={'Age':'Age'}, # can specify one label per df column
opacity=0.8,
log_y=True, # represent bars with log scale
color_discrete_sequence=['indianred'] # color of histogram bars
)
fig.show()
It is worth noting that there is an inaccuracy in the dataset where the minimum age is listed as 0. This is problematic because it suggests that some customers have not been born yet. To gain a better understanding of this inaccuracy, further analysis is required.
df['Annual Income ($)'].value_counts()
50000 7
9000 7
97000 6
85000 6
4000 6
..
111859 1
186655 1
164598 1
132951 1
110610 1
Name: Annual Income ($), Length: 1786, dtype: int64
# Create a histogram of the 'Age' column, and include the box plot to show the distribution
fig = px.histogram(df, x='Age', marginal='box')
# Display the plot
fig.show()
Moreover, in the context of machine learning and deep learning, it is crucial to have a well-balanced dataset that encompasses all age groups to ensure optimal performance of the models. Thus, the even distribution of age in this dataset can be advantageous for constructing accurate and robust models that can generalize effectively to new, unseen data
df['Spending Score (1-100)'].value_counts()
49 34
42 33
55 32
17 31
46 28
..
72 12
6 12
9 12
95 12
0 2
Name: Spending Score (1-100), Length: 101, dtype: int64
# Create a histogram of the 'Spending Score (1-100)' column, and include the box plot to show the distribution
fig = px.histogram(df, x='Spending Score (1-100)', marginal='box')
# Display the plot
fig.show()
df['Profession'].value_counts()
Artist 647 Healthcare 339 Entertainment 234 Engineer 179 Doctor 161 Executive 153 Lawyer 142 Marketing 85 Homemaker 60 Name: Profession, dtype: int64
## dataframe creation - for plotting
# create new pandas dataframe which contains all counts sorted by profession
profession_df = (
df.groupby(["Profession"])
.size()
.reset_index(name="Counts")
.sort_values(by=["Profession"])
)
profession_df
| Profession | Counts | |
|---|---|---|
| 0 | Artist | 647 |
| 1 | Doctor | 161 |
| 2 | Engineer | 179 |
| 3 | Entertainment | 234 |
| 4 | Executive | 153 |
| 5 | Healthcare | 339 |
| 6 | Homemaker | 60 |
| 7 | Lawyer | 142 |
| 8 | Marketing | 85 |
import plotly.graph_objs as go
# Create labels using all unique values in the column named "Profession"
labels = profession_df["Profession"].unique()
# Group by count of the "profession" column
values = profession_df["Counts"]
# Custom define a list of colors to be used for the pie chart
earth_colors = [
"rgb(210,180,140)",
"rgb(218,165,32)",
"rgb(139,69,19)",
"rgb(175, 51, 21)",
"rgb(35, 36, 21)",
"rgb(188,143,143)",
"rgb(50, 205, 50)",
"rgb(128, 128, 128)",
"rgb(70, 130, 180)",
]
# Define the actual figure using the dimension: profession
# Note that a pull keyword was specified to explode pie pieces out of the center
fig = go.Figure(
data=[
go.Pie(
labels=labels,
values=values,
# pull is given as a fraction of the pie radius
pull=[0.08, 0.03, 0.07, 0.08, 0.02, 0.2, 0.05, 0.04, 0],
# Iterate through earth_colors list to color individual pie pieces
marker_colors=earth_colors,
)
]
)
# Update layout to show a title
fig.update_layout(title_text="Pie chart of Profession")
# Display the figure
fig.show()
df['Work Experience'].value_counts()
1 470 0 431 8 166 9 160 7 126 4 121 6 120 5 117 10 84 2 63 3 55 12 17 13 16 14 16 11 14 15 14 16 5 17 5 Name: Work Experience, dtype: int64
# Create a bar chart to visualize the distribution of work experience in the dataset
fig = px.bar(y=df['Work Experience'].value_counts(), x=df['Work Experience'].value_counts().index, color=df['Work Experience'].value_counts().index)
# Display the plot
fig.show()
On the other hand, there are about 10 individuals in the dataset who possess more than 16 years of work experience. These individuals likely hold senior positions in their respective companies, given their extensive experience.
The remaining individuals in the dataset fall within the middle range, with work experience ranging from 4 to 10 years. This group represents a sizable portion of the dataset and likely includes professionals at various stages of their careers.
df['Family Size'].value_counts()
2 361 3 311 1 299 4 289 5 258 6 243 7 234 8 4 9 1 Name: Family Size, dtype: int64
## dataframe creation - for plotting
# create new pandas dataframe which contains all counts sorted by profession
family_df = (
df.groupby(["Family Size"])
.size()
.reset_index(name="Counts")
.sort_values(by=["Family Size"])
)
family_df
| Family Size | Counts | |
|---|---|---|
| 0 | 1 | 299 |
| 1 | 2 | 361 |
| 2 | 3 | 311 |
| 3 | 4 | 289 |
| 4 | 5 | 258 |
| 5 | 6 | 243 |
| 6 | 7 | 234 |
| 7 | 8 | 4 |
| 8 | 9 | 1 |
# Create labels using all unique values in the column named "Family"
labels = family_df["Family Size"].unique()
# Group by count of the "Family" column
values = family_df["Counts"]
# Custom define a list of colors to be used for the pie chart
earth_colors = [
"rgb(210,180,140)",
"rgb(218,165,32)",
"rgb(139,69,19)",
"rgb(175, 51, 21)",
"rgb(35, 36, 21)",
"rgb(188,143,143)",
"rgb(50, 205, 50)",
"rgb(128, 128, 128)",
"rgb(70, 130, 180)",
]
# Define the actual figure using the dimension: Family
# Note that a pull keyword was specified to explode pie pieces out of the center
fig = go.Figure(
data=[
go.Pie(
labels=labels,
values=values,
# pull is given as a fraction of the pie radius
pull=[0.08, 0.03, 0.07, 0.08, 0.02, 0.2, 0.05, 0.04, 0],
# Iterate through earth_colors list to color individual pie pieces
marker_colors=earth_colors,
)
]
)
# Update layout to show a title
fig.update_layout(title_text="Pie chart of Family Size")
# Display the figure
fig.show()
# Create a box plot of Age by Gender
age_gender_boxplot = px.box(df, x='Gender', y='Age', color='Gender', title='Distribution of Age by Gender')
# Display the plot
age_gender_boxplot.show()
# Create a box plot of Annual Income by Gender
anual_income_gender_boxplot = px.box(df, x='Gender', y='Annual Income ($)', color='Gender', title='Distribution of Annual Income ($) by Gender')
# Display the plot
anual_income_gender_boxplot.show()
# Create a box plot of Spending Score by Gender
spending_score_gender_boxplot = px.box(df, x='Gender', y='Spending Score (1-100)', color='Gender', title='Distribution of Spending Score (1-100) by Gender')
# Display the plot
spending_score_gender_boxplot.show()
# Create a box plot of Spending Score by Gender
work_experience_gender_boxplot = px.box(df, x='Gender', y='Work Experience', color='Gender', title='Distribution of Work Experience by Gender')
# Display the plot
work_experience_gender_boxplot.show()
# Create box plot for Age versus Profession
fig = px.box(df, x='Age', y='Profession', color='Profession', title='Age Distribution across Professions')
# Display the plot
fig.show()
Furthermore, it is important to note that the median age differs across professions. For instance, the median age for engineers tends to be higher, suggesting that a significant proportion of engineers are in their 60s. Conversely, the mode for marketing professionals and homemakers skews towards a younger age range, indicating that the majority of individuals in these professions are in their 40s. This information can be valuable in formulating targeted marketing strategies tailored to specific age groups or professions.
import plotly.graph_objects as go
data=df.copy()
# Assuming you have a DataFrame called 'data' with columns 'Age' and 'Work Experience'
fig = go.Figure(data=go.Scatter(
x=data['Work Experience'],
y=data['Age'],
mode='markers',
marker=dict(
size=8,
color=data['Age'], # Color points based on age for added interactivity
colorscale='Viridis', # Choose a color scale
showscale=True # Display color scale
),
text=data['Age'], # Display age as hover text
hovertemplate='Age: %{text}<br>Work Experience: %{x}', # Customize hover text
))
fig.update_layout(
title='Age vs Work Experience',
xaxis_title='Work Experience',
yaxis_title='Age',
hovermode='closest', # Show closest data point when hovering
)
fig.show()
fig = go.Figure(data=go.Scatter(
x=df['Age'],
y=df['Spending Score (1-100)'],
mode='markers',
marker=dict(
size=8,
color=df['Age'], # Color points based on age for added interactivity
colorscale='Viridis', # Choose a color scale
showscale=True # Display color scale
),
text=df['Age'], # Display age as hover text
hovertemplate='Age: %{text}<br>Spending Score: %{y}', # Customize hover text
))
fig.update_layout(
title='Age vs Spending Score',
xaxis_title='Age',
yaxis_title='Spending Score (1-100)',
hovermode='closest', # Show closest data point when hovering
)
fig.show()
The absence of a discernible pattern between Age and Spending Score in the dataset is intriguing. Typically, it is commonly observed that younger individuals tend to spend more money on purchases, while older individuals tend to spend less. However, in this particular dataset, such a relationship is not evident.
There could be several reasons for this lack of correlation. It's possible that factors other than age, such as income level, personal preferences, or individual circumstances, have a stronger influence on spending behavior in this dataset. Additionally, the dataset may not accurately represent the general population, as it could be limited to a specific demographic or a unique set of individuals with distinct spending habits.
fig = go.Figure(data=go.Scatter(
x=df['Annual Income ($)'],
y=df['Spending Score (1-100)'],
mode='markers',
marker=dict(
size=8,
color=df['Annual Income ($)'], # Color points based on annual income for added interactivity
colorscale='Viridis', # Choose a color scale
showscale=True # Display color scale
),
text=df['Annual Income ($)'], # Display annual income as hover text
hovertemplate='Annual Income: $%{text}<br>Spending Score: %{y}', # Customize hover text
))
fig.update_layout(
title='Annual Income vs Spending Score',
xaxis_title='Annual Income ($)',
yaxis_title='Spending Score (1-100)',
hovermode='closest', # Show closest data point when hovering
)
fig.show()
# Create a box plot for annual income grouped by profession
fig = px.box(df, y='Annual Income ($)', x='Profession', color="Profession")
# Set the title of the plot
fig.update_layout(title_text='Annual Income Distribution by Profession')
# Show the plot
fig.show()
Additionally, the median income for the mentioned professions remains stable around 100K, while the median income for individuals in the home-making profession shows a slight decrease. Interestingly, the marketing profession stands out with a consistent income distribution but an upward shift in the median income.
fig = px.scatter(df, x='Work Experience', y='Annual Income ($)', color='Work Experience', hover_data=['Annual Income ($)'])
fig.update_layout(title='Annual Income ($) vs Work Experience', xaxis_title='Work Experience', yaxis_title='Annual Income ($)')
fig.show()
Upon analyzing the scatter plot, an interesting observation is made that challenges the common belief that higher years of experience should correspond to higher annual income. Contrary to expectations, the plot reveals that there is no clear relationship between years of experience and the amount of annual income earned.
This finding is surprising because it goes against the notion that individuals with more experience should generally earn higher salaries. In the dataset, even individuals with no prior work experience (freshers) have high annual incomes, with some earning up to 189.945K USD. On the other hand, the highest annual income for a person with 17 years of experience is 180.331K USD, which is lower than some freshers' incomes.
This observation suggests that other factors, such as job role, industry, education level, or negotiation skills, may have a stronger influence on annual income than years of experience alone. It highlights the complexity of the relationship between experience and income and emphasizes the importance of considering multiple factors when analyzing salary trends.
fig = px.box(df, x='Profession', y='Work Experience', color='Gender',
title='Work Experience by Profession and Gender',
labels={'Work Experience': 'Years of Work Experience'})
fig.show()
Upon analyzing the box plot of work experience across different professions, several important findings emerge.
Professions like healthcare, executive, doctor, and marketing show a wider range of work experience compared to other sectors. However, the lawyer and entertainment sectors have a relatively low median work experience of only one year.
In contrast, the healthcare, executive, and doctor professions have median work experience ranging from one to around eight years, which is more in line with expectations for these fields. The median work experience for doctors is lower at just two years, which may indicate room for improvement.
Some notable outliers are observed, such as individuals with 17 years of work experience in the lawyer and artist sectors, which is impressive.
The homemaker profession stands out with a wider range of work experience, spanning from around three to nine years. This suggests that once individuals enter this profession, they tend to stay for a longer period. The median work experience for homemakers is also relatively high, with the maximum median value in the entire distribution being around seven years.
Gender differences contribute to variations in the distribution. For example, in healthcare, the median work experience is lower for females and higher for males, potentially reflecting gender norms and the perception of doctors as male and nurses as female.
In engineering, females have a higher median work experience compared to males. Similarly, in the doctor profession, females have one year of work experience, while males have three years, despite having a similar overall range.
For the homemaker profession, men tend to start earlier, with a work experience of around two years, while women have a minimum work experience of four years. However, the median work experience is the same for both genders.
These insights highlight the variations in work experience across professions and shed light on gender differences within specific fields. Understanding these patterns can be useful for making informed decisions related to career choices, workforce planning, and identifying areas for improvement.